Summarization of sumo video content

ABSTRACT

Summarization of video content including sumo.

BACKGROUND OF THE INVENTION

The present invention relates to summarization of video contentincluding sumo.

The amount of video content is expanding at an ever increasing rate,some of which includes sporting events. Simultaneously, the availabletime for viewers to consume or otherwise view all of the desirable videocontent is decreasing. With the increased amount of video contentcoupled with the decreasing time available to view the video content, itb1ecomes increasingly problematic for viewers to view all of thepotentially desirable content in its entirety. Accordingly, viewers areincreasingly selective regarding the video content that they select toview. To accommodate viewer demands, techniques have been developed toprovide a summarization of the video representative in some manner ofthe entire video. Video summarization likewise facilitates additionalfeatures including browsing, filtering, indexing, retrieval, etc. Thetypical purpose for creating a video summarization is to obtain acompact representation of the original video for subsequent viewing.

There are two major approaches to video summarization. The firstapproach for video summarization is key frame detection. Key framedetection includes mechanisms that process low level characteristics ofthe video, such as its color distribution, to determine those particularisolated frames that are most representative of particular portions ofthe video. For example, a key frame summarization of a video may containonly a few isolated key frames which potentially highlight the mostimportant events in the video. Thus some limited information about thevideo can be inferred from the selection of key frames. Key frametechniques are especially suitable for indexing video content but arenot especially suitable for summarizing sporting content.

The second approach for video summarization is directed at detectingevents that are important for the particular video content. Suchtechniques normally include a definition and model of anticipated eventsof particular importance for a particular type of content. The videosummarization may consist of many video segments, each of which is acontinuous portion in the original video, allowing some detailedinformation from the video to be viewed by the user in a time effectivemanner. Such techniques are especially suitable for the efficientconsumption of the content of a video by browsing only its summary. Suchapproaches facilitate what is sometimes referred to as “semanticsummaries”.

What is desired, therefore, is a video summarization technique suitablefor video content that includes sumo.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an exemplary flowchart for play detection.

FIG. 2 is an exemplary illustration of a pre-bout scene in sumo.

FIG. 3 is a technique for detecting a start frame of a sumo “play.”.

FIG. 4 is a pre-bout scene in sumo.

FIG. 5 illustrates the skin color and ring color of FIG. 4.

FIG. 6 illustrates binarized skin color of FIG. 5.

FIG. 7 is a horizontal projection of FIG. 6.

FIG. 8 is a vertical projection of FIG. 6.

FIGS. 9A-9C is a series of sequential images in a video clip showing twosumo contestants colliding.

FIG. 10 is an illustration of temporal evidence accumulation.

FIG. 11 is an illustration of color histogram differences.

FIG. 12 is an illustration of absolute pixel-to-pixel differences inluminance domain.

FIG. 13 illustrates scene cut detection.

FIG. 14 illustrates names in a sumo video.

FIGS. 15A-15C illustrate audio segments of different plays.

FIG. 16 illustrates forming a multi-layered summary of the originalvideo sequence.

FIG. 17 illustrates the video summarization module as part of a mediabrowser and/or a service application.

FIG. 18 illustrates a video processing system.

FIG. 19 illustrates an exemplary overall structure of the sumosummarization system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Sumo, the national sport of Japan, is tremendously popular in easternAsia and is growing in popularity elsewhere in the world. Sumo is asport comprising bouts in which two contestants meet in a circular ring4.55 meters in diameter. The rules of Sumo are uncomplicated. After thecontestants and a referee have entered the circular ring, the boutbegins with an initial charge—called a “tachiai”—where each contestantrushes towards, then collides with, the other. The bout will end whenone of the contestant loses by either stepping outside the circular ringor touching the ground with any part of the contestant's body other thanthe soles of the feet. Aside from a limited number of illegal moves,such as gouging the opponent's eyes, striking with a closed fist, orintentionally pulling at the opponent's hair, there are no rules thatgovern a sumo bout.

Sumo participants may compete against each another in one of a number oftournaments. Japan sponsors six sanctioned Grand Sumo tournaments, heldin odd-numbered months throughout the year, in which competitive sumocontestants face one another with the opportunity for advancement inrank. Sumo contestants are ranked under a strict meritocracy; winningbouts in these sanctioned tournaments improves a competitor's rank whilelosing bouts diminishes that rank. Aside from the six sanctionedtournaments, a number of exhibition tournaments—called Jungyo—arescheduled throughout the year.

Though a sumo tournament will typically take place over several weekswith bouts scheduled throughout each day, most bouts of interest, i.e.those involving higher ranked contestants, are scheduled to begin lateafternoon when live television broadcasts of the tournament occur. Theseportions of the sumo tournaments usually last 2-3 hours each day and areoften video recorded for later distribution or for re-broadcast.

Though such a video of a sumo tournament might typically last about 2-3hours, only about ten minutes turns out to include time during which twoplayers are in a bout. An individual sumo bout is brief; the typicalbout will end with the initial collision, though a rare bout might lasttwo to three minutes. Interspersed between bouts are a large number ofceremonies that precede and follow each bout.

Though brief, the time intervals during which a bout is proceeding areintense and can captivate those in the viewing audience, many of whomare able to identify a myriad of named sumo techniques that may occur inrapid succession. Such techniques include a “kekaeshi” (a foot-sweep), a“kubinage” (a head-lock throw), and an “izori” (a technique where acontestant crouches below the opponent's rush, grabbing one of theopponent's legs, lifting the opponent upon the shoulders and fallingbackwards), as well as some sixty five to seventy more named sumotechniques or occurrences.

The remaining time during the sumo tournament is typically not excitingto watch on video. Such time would include for example inter-boutchanges of players, pre-bout exercises and ceremonies, post-boutceremonies and in the case of broadcast, nearly endless commercials.While it may indeed be entertaining to sit in an arena for several hoursfor a sumo tournament, many people who watch a video of a sumotournament find it difficult to watch all of the tournament, even ifthey are rabid fans. Further, the tournaments are held during daytimehours, hence many fans are unable to attend a tournament or to watch alive broadcast due to work. Such fans may nonetheless be interested inwatching specific bouts or some other condensed version of thetournament. Thus a video summarization of the sumo tournament thatprovides a summary of the tournament having a duration shorter than theoriginal sumo video, may be appealing to many people. The videosummarization should provide nearly the same level of the excitement(e.g. interest) that the original game provided.

Upon initial consideration, sumo would not be a suitable candidate toattempt automated video summarization. Initially, there are nearly anendless number of potential moves that may occur that would need to beaccounted for in some manner. In addition, each of these moves mayinvolve significant player motion that is difficult to anticipate,difficult to track, and is not consistent between plays. In addition,the players are flesh toned and the ring is likewise generally fleshtoned making identification of the events difficult. Based upon suchconsiderations it has been previously considered impractical, if notimpossible, to attempt to summarize sumo.

It is conceivably possible to develop highly sophisticated models of atypical sumo video to identify potentially relevant portions of thevideo. However, such highly sophisticated models are difficult to createand are not normally robust. Further, the likelihood that a majority ofthe highly relevant portions of the sumo video will be included in sucha video summarization is low because of the selectivity of the model.Thus the resulting video summarization of the sumo tournament may simplybe unsatisfactory to the average viewer.

After consideration of the difficulty of developing highly sophisticatedmodels of a sumo video to analyze the content of the sumo video, as thesole basis upon which to create a sumo summarization, the presentinventors determined that this technique is ultimately flawed as themodels will likely never be sufficiently robust to detect all thedesirable content. Moreover, the number of different types of modelsequences of potentially desirable content is difficult to quantify. Incontrast to attempting to detect particular model sequences, the presentinventors determined that the desirable segments of the sumo match arepreferably selected based upon a “play”. A “play” may be defined as asequence of events defined by the rules of sumo. In particular, and inone aspect, the sequence of events of a “play” may generally include thetime between which the players line up to charge one another and oneplayer loses the bout by either stepping outside the sumo ring ortouching the clay surface with a part of his body other than the solesof the feet. A play may also selectively include certain pre-boutceremonies or events, such as the time during which the contestantsthrow salt in the ring or stare at one another prior to charging.Normally the “play” should include a related series of activities thatcould potentially result in a victory by one contestant and a loss bythe other contestant.

It is to be understood that the temporal bounds of a particular type of“play” does not necessarily start or end at a particular instance, butrather at a time generally coincident with the start and end of the playor otherwise based upon, at least in part, a time (e.g., event) basedupon a play. For example, a “play” starting with the contestantsthrowing salt into the ring may include the times during which thecontestants charge each other. A summarization of the video is createdby including a plurality of video segments, where the summarizationincludes fewer frames than the original video from which thesummarization was created. A summarization that includes a plurality ofthe plays of the sumo match provides the viewer with a shorted videosequence while permitting the viewer to still enjoy the game becausemost of the exciting portions of the video are provided, preferably inthe same temporally sequential manner as in the original sumo video.

Referring to FIG. 1, a procedure for summarization of a sumo videoincludes receiving a video sequence 20 that includes at least a portionof a sumo match. Block 22 detects the start of a play of a video segmentof a plurality of frames of the video. After detecting the start of theplay, block 24 detects the end of the play, thereby defining a segmentof video between the start of the play and the end of the play, namely,a “play”. Block 26 then checks to see if the end of the video (or theportion to be processed) has been reached. If the end of the video hasnot been reached block 26 branches to block 22 to detect the next play.Alternatively, if the end of the video has been reached then block 26branches to the summary description 28. The summary description definesthose portions of the video sequence 20 that contain the relevantsegments for the video summarization. The summary description may becompliant with the MPEG-7 Summary Description Scheme or TV-AnytimeSegmentation Description Scheme. A compliant media browser, such asshown in FIG. 17, may apply the summary description to the input videoto provide summarized viewing of the input video without modifying it.Alternatively, the summary description may be used to edit the inputvideo and create a separate video sequence. The summarized videosequence may comprise the selected segments which excludes at least aportion of the original video other than the plurality of segments.Preferably, the summarized video sequence excludes all portions of theoriginal video other than the plurality of segments.

FIG. 1 is intended to show a basic procedure for obtaining such asummary, where the summary description contains only the start and endpoints of the detected plays. The summarization shown in FIG. 1 isprimarily a low-level one, though in more complex situations it maycontain other information, i.e. names of contestants etc. The benefit ofa low-level summary is that it provides sufficient detail for people toappreciate a game from the summary. The low-level summary may then formthe basis for a higher level summarization, if desired. As one example,a higher level summary can be obtained by keeping only those playsreceiving loud audience acclaims, achieved by adding an audio analysisprocedure. Alternatively, in combination with a captiondetection/recognition module, a summary can be obtained of only thoseplays containing a specific contestant. A yet higher summary level maycontain only key frames from the plays for indexing purposes.

One component of the summarization procedure depicted in FIG. 1 is thedetection of an event, or “play.” If the start and end points of allplays are detected, then the system may string all the plays together toobtain a summary from the original video and perform some postprocessing to smooth the transition boundaries, such as using dissolvingtechniques to reduce abrupt change between plays and smoothing the audiofiled for better auditory effects. Further, the summary should ideallycontain only those segments comprising a “play” as earlier defined, thusproviding a compact representation of the original tournament; the usercan spend only a few minutes to watch it, yet almost all of theexcitement of the original game can be appreciated.

One of the difficulties in the detection of a “play” in a sumo broadcastis that frames in one play may sweep a large range of color, yet all theframes belong to the same event, and form an uninterrupted video clip.Thus a generic summarization scheme that uses, for example, a colorhistogram as the cue for key frame detection or scene classification,may not be particularly effective. In light of such difficulties, thepresent inventors have developed an alternate method for detecting a“play” that is specifically tailored to sumo content.

Still referring to FIG. 1, a summary is to be obtained by firstdetecting the boundaries of a “play.” In a sumo bout, two contestantsmeet in a ring 4.55 meters across. Though they wear silk belts aroundtheir waists, the players are otherwise unclothed. There are strictrules as to where the players and the referee, called a “Gyoji,” are tostand in the moments immediately proceeding the initiation of the bout.Cameras are situated at fixed locations around the ring capture the sumobout. The cameras can typically pan, tilt, and zoom. The primary cameratypically is situated opposite to the side where the referee stands.Thus a bout usually starts with a scene as illustrated in FIG. 2, andthe bout will almost always be broadcast in its entirety by the primarycamera from this vantage. Video captured by any other camera istypically used exclusively for replays, player close-ups, or post-boutceremonies, all of which take place after the bout has ended. Thisformat is adhered to because the primary camera can best cover theaction of the bout, which usually lasts for mere moments, making itimpractical to switch camera angles during a bout.

Based on these observations, the inventors have developed a model for“play” detection. A play starts with a scene as in FIG. 2. The timebetween the scene cut at the end of a current play and the start of thefollowing play is not usually exciting and can thus be excluded from acompact summary. Note that a scene like that shown in FIG. 2 istypically merely a necessary condition, not a sufficient condition. In asumo tournament, there are many pre-game ceremonies that result in ascene like that shown in FIG. 2, but the contestants, are not yet readyto initiate the bout. Thus in order to detect the start of a play, inaddition to finding a scene like that depicted in FIG. 2, it shoulddetermine whether the scene is an immediate precursor to the start of about. One test would be to determine whether the contestants charge oneanother and collide, because that is how each bout begins. In otherwords, the methodology of detecting whether the start of a “play” hasoccurred involves locating a frame similar to that shown in FIG. 2 thenapplying a test to determine whether the frame immediately precedes thestart of a bout.

The location of frames similar to that shown in FIG. 2 may be based onthe anticipated characteristics of the image, as opposed to an actualanalysis of the events depicted in the video. Under the assumption thata camera gives a typical start frame like that shown in FIG. 2, one canobserve that the lower part of such frame contains the stage in whichthe sumo ring is defined. The stage in the lower part of the frame isusually of fixed color and lighter than the generally dark color of theupper part of the frame. This is usually true because a sumo stage is tobe constructed according to the same specifications. Further, in a sumotournament the lights are usually focused on the stage which give tendsto shroud the background in darkness. In addition, each bout is precededwith the two contestants facing one another in a symmetric positionabout the center of the ring with the referee to the side and betweenthe contestants and the primary camera facing the referee.

The color of the stage can be estimated from sample data; given a set ofsample frames containing the stage, a set of parameters can give anestimate for the stage color. Detecting the players is a more difficulttask. Theoretically, one could use complex methods such as thoseexplicitly modeling the shape of a human body. To achieve fastcomputation, the present inventors have identified a simpler method indescribing a player: a player is represented by a color blob with skintone. Thus assuming that an estimate for skin tone is obtained, twoblobs corresponding to the two respective players could be segmented. Asmentioned earlier, in a Sumo broadcast, there are pregame ceremoniesthat could result in frames like a start frame. To enable this type offalse alarm to be eliminated, the players should be tracked after theyare detected to see if they move towards each other and eventuallycollide with each other, as would occur at the beginning of a “play” asearlier defined.

One method for detecting the beginning of a play may proceed as shown inFIG. 3. Given a stage color description Cs and skin tone description Ck,a video frame image IM can be examined to determine whether the imagerepresents the beginning of a “play.” The color descriptions, may be forexample, a single color, a range of colors, a set of colors, in one ormore color spaces. First, the image is examined to determine if it has adark upper portion and a lower portion dominated (25% or more, 50% ormore, or 75% or more) by the color Cs+Ck. If not, then the image isdetermined as a non-start frame. If yes, then the image is examined todetermine whether there are two dominant (25% or more, 50% or more, or75% or more) color blobs of color Ck, nearly symmetric to each otherwith respect to a generally center column (+/−20% of the width of theframe off center) of the frame. If not, then the image is determined asa non-start frame. If yes, subsequent frames are examined to determinewhether the two dominant color blobs move towards, and eventuallycollide with, one another. If so, the original frame image IM isdetermined a start frame, otherwise it is determined not to be a startframe. The technique may be modified to include fewer tests oradditional tests, in the same or a different sequence.

It turns out that a difficult part of this method is to segment theplayer blobs from the stage because the stage color Cs and the skin toneCk are overlapping in typical color space. It is impossible to perfectlyseparate skin from the stage only using color information, which meansthat the player detection is always imperfect and the players areusually detected as fragmented pieces. In fact, this is inevitable,considering that the players often wear belts of various non-skin tonecolor. If a single blob is to be detected for each player, then anadditional module must be used to group the fragmented pieces. Thismodule may again introduce additional inaccuracies, aside from thedemand for additional computation.

To avoid the computational burden and potential inaccuracies of such agrouping procedure, the present inventors discovered that the foregoingmethod for detecting the beginning of a play may be implemented byrepresenting and tracking the blobs through their one-dimensionalprojections. FIG. 4 shows a candidate image IM that is a representativestart frame of a sumo “play” as earlier defined, and thus should bedetected by the summarization procedure shown in FIG. 1. Given the stagecolor description Cs and the skin tone description Ck, the candidateimage shown in FIG. 4 may be reduced to the image shown in FIG. 5 wherewhite pixels indicate a place where there is a pixel in the candidateimage corresponding to either the stage color Ck or the skin color Cs.The black pixels represent the dark background areas of the candidateimage. The image may be further decomposed using skin-tone basedsegmentation to isolate those portions of the image corresponding to theskin color Cs. A binary image, shown in FIG. 6 may be used to representthe obtained body parts, in which numeral ones represent a pixel of thatlocation representing skin in the original image. This binary image maybe projected along vertical and horizontal axes, shown in FIGS. 7 and 8,respectively. The analysis of the blob may be performed on thoseprojections. The proposed projection behaves effectively like anintegration process, which makes the algorithm less sensitive toimperfection in the skin/stage segmentation. Note that in theseprojections, small and isolated peaks have been suppressed.

Ideally, a real start frame will result in two peaks of similar size inthe vertical projection, nearly symmetric about the center column of theimage, as shown in FIGS. 7 and 8, the horizontal projection of thebinary image, may be used to check whether the two blobs are symmetricabout a center column of the image. In subsequent frames, these twopeaks should move closer and closer, eventually converging, asillustrated by FIGS. 10A, 10B, and 10C.

The foregoing method relies mainly on color cues, and prior knowledgeabout the stage color Cs and the skin tone Ck are assumed. However, itis also possible to calibrate the colors for a specific bout ortournament. With other inputs such as a human operator's interactions,the calibration is of course easy to do. Without any human interaction,statistical models can still be used to calibrate the color. If a seriesof start scene candidates has been obtained, statistical outliers inthis set can be detected with prior coarse knowledge about Cs and Ck.The remaining candidate frames can then be used to estimate thespecifics of the colors. With the colors calibrated, the start-of-playdetection can be performed more accurately.

The foregoing method is able to detect start frames successfully in mostsituations. However, if the detection of a start frame is declared afterfinding only one candidate frame, then the method may be susceptible tofalse-positives. By examining a set of consecutive frames (or othertemporally related frames) and accumulating evidence, the system canreduce the false-positive rate. Referring to FIG. 10, the followingapproach may be used to achieve temporal evidence of accumulation: whendetecting the start of a “play”, a sliding window of width w is used(e.g., w frames are considered at the same time). A start is declaredonly if more than p out of the w frames in the current window aredetermined to be start scene candidates, as previously described. Asuitable value of p is such that p/w=70%. Other statistical measures maybe used of a fixed number of frames or dynamic number of frames to moreaccurately determine start scenes.

While the start of a “play” may be found according to the aforementionedmethod, the end of a “play” can occur in a variety of different ways dueto the numerous techniques used to either force the opposing contestantto the ground or out of the ring. Image analysis techniques may be usedto analyze the image content of the frames after the beginning of a boutto attempt to determine what occurred, but with the nearly endlesspossibilities and the difficulty of interpreting the content of theframes, this technique is at least, extremely difficult andcomputationally intensive. In contrast to attempting to analyze thecontent of the subsequent frames of a potential play, the presentinventors determined that a more efficient manner for the determinationof the extent of a play in sumo is to base the end of the play on cameraactivities. After analysis of a sumo video the present inventors weresurprised to determine that the approximate end of a play may be modeledby scene changes, normally as a result of switching to a differentcamera or a different camera angle. The different camera or differentcamera angle may be modeled by determining the amount of change betweenthe current frame (or set of frames) to the next frame (or set offrames).

Referring to FIG. 11, a model of the amount of change between framesusing a color histogram difference technique for an exemplary 1,000frame video sumo clip is shown. The peaks typically correspond to scenecuts. Unfortunately, FIG. 11 demonstrates, some scene cuts, like the onedepicted at around frame 325, the camera break produces a relatively lowpeak in the color histogram difference curve, causing potential failurein scene cut detection.

To solve this problem, the inventors have discovered that the use ofcolor histogram differences in conjunction with the sum of absolutepixel-to-pixel differences in the luminance domain is more effectivewhen detecting a scene cut. To gain robustness in using the sum ofabsolute pixel-to-pixel differences, the luminance images are firstdown-sampled, or smoothed. FIG. 13 shows the sum of absolutepixel-to-pixel luminance differences for the same video clip as shown inFIG. 11.

Even with the aforementioned technique there may be some falsedetections which do not correspond to a real play. Also, there aresituations in which a play is broken into two segments due to forexample, dramatic lighting fluctuations (mistaken by the system as ascene cut). Some of these problems can be remedied by post-processing.One example of a suitable post processing technique is if two plays areonly separated by a sufficiently short time duration, such as less thana predetermined time period, then they should be connected as a singleplay. The time period between the two detected plays may be includedwithin the total play, if desired. Even if the two detected plays areseparated by a short time period and the system puts the two playstogether, and they are in fact two separate plays, this results in anacceptable segment (or two plays) because it avoids frequent audio andvisual disruptions in the summary, which may be objectionable to someviewers. Another example of a suitable post processing technique is thatif a play has a sufficiently short duration, such as less than 2seconds, then the system should remove it from being a play because itis likely a false positive. Also, post-processing may be applied tosmoothen the connection between adjacent plays, for both video andaudio.

Sumo video may also include gradual transitions between plays and otheractivities, such as commentary. These gradual transitions tend to becomputationally complex to detect in the general case. However, in thecase of sumo it has been determined that detecting gradual transitionsbased upon the color histogram differences is especially suitable. Othertechniques may likewise be used. Referring to FIG. 13, the preferredtechnique may include starting from a start-of-play time (t_(o)) andlooking forward until a sufficiently large scene change is detected oruntil time t_(o)+t_(p) is reached, whichever occurs first. T_(p) relatesto the maximum anticipated play duration and therefore automaticallysets a maximum duration to the play. This time period for processing tolocate gradual transitions is denoted as t_(clean) _(—) _(cut). Ift_(clean) _(—) _(cut)<t_(low) then the system will not look for agradual scene cut and set the previously detected scene cut as the endof the play. This corresponds to an anticipated minimum time durationfor a play and t_(low) is used to denote the minimum time period.Otherwise, the system looks for the highest color histogram differencein the region t_(low), t_(clean) _(—) _(cut) or other measure of apotential scene change. This region of the segment is from the minimumtime duration to the next previously identified scene cut. Thisidentifies the highest color histogram difference in the time durationwhich may be a potential scene change. The time of the highest colorhistogram difference is identified at t₁. In a neighborhood of t₁,[t₁−c₁, t₂+c₂], a statistical computation is performed, such ascomputing the mean ml and the standard deviation F of the colorhistogram differences. C₁ and c₂ are constants or statisticallycalculated temporal values for the region to examine around the highestcolor histogram difference. A mean filtering emphasizes regions having arelatively large difference in a relatively short time interval. If thecolor histogram differences at t₁ exceeds m₁+c₃*F₁, where c₃ is aconstant (or otherwise) and some of its neighbors (or otherwise) aresufficiently large, then the system considers a gradual transition tohave occurred at around time (frame) t₁. The play is set to the shorterof the previously identified scene cut or the gradual transition, ifany.

The summary obtained by the method described above contains only playsegments from the original video. Even though a Sumo fan may be able toquickly recognize the players after they appear, it may help a viewer tofollow the game better if we detect those pre-play frames that containsplayer's names. An example of such type of frames is given in FIG. 14.

There are various ways of detecting overlaid graphical text content froman original image or video. In this application, the problem is one ofdetecting Kanji (Chinese characters used in Japanese) in images. Withsufficient sample data, the system may train a convolution neuralnetwork to perform this task. In Sumo broadcasting there are a fewspecial patterns that are typically adopted in presenting the graphicalcharacters. For example, the names of the two players are the biggestcharacters. Also, it appears that the names normally appear in white (orsubstantial contrast to the background). This is probably due to thefact that the names are usually overlaid on a dark scene of the sumostadium. In addition the graphical information is symmetric with respectto the center column, with one player's information on the left, and theother player's information on the right. The characters read verticallyfrom top to bottom.

These special patterns can be utilized to facilitate a neural networkbased character detection module. The system may include an algorithm tofind frames with these patterns. The present inventors have found thatthe following set of rules may successfully detect frames with thedesired player names in a video: (1) the frame has white blocks that arenearly symmetrically distributed about the center column of the image;(2) except for these white blocks, there should be no other white areasof significant size in the frame; (3) these white blocks persist for atleast a few seconds; and (4) the set of frames with persistent whiteblocks proceeds to the start of a play. One or more of these rules maybe included, as desired.

After the frames with the player names are detected, the system may addthem to their respective plays and obtain a new summary. Unlike thebaseline summary obtained before, in this new summary, there are a fewseconds of video like that in FIG. 14 for introducing each play. Thusthe new summary is easier to follow.

If desired, a slow motion replay detection module may be incorporated.The system detects if a slow motion replay has occurred, which normallyrelates to important events. The system will capture the replays ofplays, the same as the typical non-slow motion replay (full speed), ifthe same type of camera angles are used. The play segments detected maybe identified with multiple characteristics, namely, slow motionreplay-only segments, play only segments without slow motion replaysegments, and slow motion replay that include associated full speedsegments. The resulting summary may include one or more of the differentselections of the aforementioned options, as desired. For example, theresulting summary may have the slow-motion replays removed. Theseoptions may likewise be user selectable.

While an effective summarization of a sumo video may be based on theconcept of the “play”, sometimes the viewer may prefer an even shortersummarization with the most exciting plays included. One potentialtechnique for the estimation of the excitement of a play is to performstatistical analysis on the segments to determine which durations aremost likely to have the highest excitement. However, this technique willlikely not provide sufficiently accurate results. Further, excitementtends to be a subjective measure that is hard to quantify. After furtherconsideration the present inventors came to the realization that theaudio provided together with the video provides a good indication of theexcitement of the plays. For example, the volume of the response of theaudience and/or the commentators provides a good indication of theexcitement. The louder audience and/or commentator acclamations, thegreater the degree of excitement.

Referring to FIGS. 15A-15C, an exemplary illustration is shown of audiosignals having a relatively quiet response (FIG. 15A), having a strongresponse (FIG. 15B), and having an extremely strong response (FIG. 15C).In general, it has been determined that more exciting plays have thefollowing audio features. First, the mean audio volume of the play islarge. The mean audio volume may be computed by defining the mean volumeof a play as $v = {\frac{1}{N}{\sum\limits_{i = 0}^{N - 1}{S^{2}(i)}}}$where S(i) is the i-th sample, and the N is the total number of samplesin the play. Second, the play contains more audio samples that havemiddle-ranged magnitudes. The second feature may be reflected by thepercentage of the middle-range-magnituded samples in the play, which maybe computed as$P = {\frac{1}{N}{\sum\limits_{i = 0}^{N - 1}{I\left( {{{\_ s}(i)\_} > {{t1}\quad{and\_ s}(i)\_} < {t2}} \right)}}}$with I( ) being the indicator function (I(true)=1, and I(false)=0), t1and t2 are two thresholds defining the middle range.

Referring to FIG. 16, the first layer of the summary is constructedusing the play detection technique. The second and third layers (andother) are extracted as being of increasingly greater excitement, basedat least in part, on the audio levels of the respective audio of thevideo segments. Also, it would be noted that the preferred audiotechnique only uses the temporal domain, which results in acomputationally efficient technique. In addition, the level of the audiomay be used as a basis for the modification of the duration of aparticular play segment. For example, if a particular play segment has ahigh audio level then the boundaries of the play segment may beextended. This permits a greater emphasis to be placed on those segmentsmore likely to be exciting. For example, if a particular play segmenthas a low audio level then the boundaries of the play segment may becontracted. This permits a reduced emphasis to be placed on thosesegments less likely to be exciting. It is to be understood that thelayered summarization may be based upon other factors, as desired.

Referring to FIG. 17, the video summarization may be included as part ofan MPEG-7 based browser/filter, where summarization is included withinthe standard. The media summarizer may be as shown in FIG. 1. Withdifferent levels of summarization built on top of the aforementionedvideo summarization technique, the system can provide the user withvarying levels of summaries according to their demands. Once the summaryinformation is described as an MPEG-7 compliant XML document, one canutilize all the offerings of MPEG-7, such as personalization, wheredifferent levels of summaries can be offered to the user on the basis ofuser's preferences described in an MPEG-7 compliant way. Descriptions ofuser preferences in MPEG-7 include preference elements pertaining todifferent summary modes and detail levels.

In the case that the summarization is performed at a server or serviceprovider, the user downloads and receives the summary descriptionencoded in MPEG-7 format. Alternatively, in an interactive video ondemand (VOD) application, the media and its summary description resideat the provider's VOD server and the user (e.g., remote) consumes thesummary via a user-side browser interface. In this case, the summary maybe enriched further by additional information that may be added by theservice provider. Further, summarization may also be performed by theclient.

Referring to FIG. 18, the output of the module that automaticallydetects important segments may be a set of indices of segmentscontaining plays and important parts of the input video program. Adescription document, such as an MPEG-7 or TV-Anytime compliantdescription is generated in The Description Generation module. Summarysegments are made available to the Post-Processing module by TheExtraction of Summary Segments module which processes the input videoprogram according to the description. A post-processing module processesthe summary Segments and/or the description to generate the finalsummary video and final description. The post-processing module puts thepost-processed segments together to form the final summary video. Thepost-processing module may transcode the resulting video to a formatdifferent that of the input video to meet the requirements of thestorage/transmission channel. The final description may also be encoded,e.g., binarized if it is generated originally in textual format such asXML. Post-processing may include adding to the original audio track acommentary, insertion of advertisement segments, or metadata. Incontrast to play detection, post-processing may be completely, or inpart, manual processing. It may include, for example, automatic rankingand subset selection of events on the basis of automatic detection offeatures in the audio track associated with video segments. Thisprocessing may be performed at the server and then the resulting videotransferred to the client, normally over a network. Alternatively, theresulting video is included in a VOD library and made available to userson a VOD server.

Referring to FIG. 19, a system may be developed that incorporates startdetection of a play, end detection of a play, and summarization. Thedetection technique may be based upon processing a single frame,multiple frames, or a combination thereof.

The terms and expressions which have been employed in the foregoingspecification are used therein as terms of description and not oflimitation, and there is no intention, in the use of such terms andexpressions, of excluding equivalents of the features shown anddescribed or portions thereof, it being recognized that the scope of theinvention is defined and limited only by the claims which follow.

1-17. (canceled)
 18. A method of processing a video including sumocomprising: (a) identifying a plurality of segments of said video,wherein the start of said plurality of segments is identified based upona frame of said video having an upper spatial region being substantiallydarker than a lower spatial region of said frame, where each of saidsegments includes a plurality of frames of said video; and (b) creatinga summarization of said video by including said plurality of segments,where said summarization includes fewer frames than said video.
 19. Themethod of claim 18 wherein said lower spatial region comprises, at leastin part, a pair of regions having a dominant color descriptionrepresentative of skin tone.
 20. The method of claim 19 furthercomprising said lower spatial region comprises, at least in part, a pairof regions having a dominant color description representative of stagecolor.
 21. The method of claim 19 further comprising said lower spatialregion comprises, at least in part, a pair of regions having a dominantcolor description representative of stage color.
 22. A method ofprocessing a video including sumo comprising: (a) identifying aplurality of segments of said video, wherein the start of said pluralityof segments is identified based upon a pair of regions having a dominantcolor description representative of skin tone, where each of saidsegments includes a plurality of frames of said video; and (b) creatinga summarization of said video by including said plurality of segments,where said summarization includes fewer frames than said video.
 23. Themethod of claim 22 wherein said dominant color description includes 25percent of said pair of regions.
 24. The method of claim 22 wherein saiddominant color description includes 50 percent of said pair of regions.25. The method of claim 22 wherein said dominant color descriptionincludes 75 percent of said pair of regions.
 26. The method of claim 22wherein said pair of regions is in the lower portion of said video. 27.A method of processing a video including sumo comprising: (a)identifying a plurality of segments of said video, wherein the start ofsaid plurality of segments is identified based upon a pair of regionsgenerally symmetric to each other with respect to a generally centercolumn of a frame of said video, where each of said segments includes aplurality of frames of said video; and (b) creating a summarization ofsaid video by including said plurality of segments, where saidsummarization includes fewer frames than said video.
 28. The method ofclaim 27 wherein said pair of spatial regions have a dominant colordescription representative of skin tone.
 29. The method of claim 27wherein said center column is within 20 percent of the center of saidframe.
 30. The method of claim 29 wherein said center column is thecenter of said frame.
 31. A method of processing a video including sumocomprising: (a) identifying a plurality of segments of said video,wherein the start of said plurality of segments is identified based upona pair of spatial regions that move toward one another, where each ofsaid segments includes a plurality of frames of said video; and (b)creating a summarization of said video by including said plurality ofsegments, where said summarization includes fewer frames than saidvideo.
 32. The method of claim 31 wherein said pair of spatial regionshave a dominant color description representative of skin tone.
 33. Themethod of claim 31 wherein said pair of spatial regions collide with oneanother.
 34. A method of processing a video including sumo comprising:(a) identifying a plurality of segments of said video, (i) wherein thestart of said plurality of segments is identified based upon a frame ofsaid video having an upper spatial region being substantially darkerthan a lower spatial region of said frame, (ii) wherein said lowerspatial region comprises, at least in part, a pair of regions having adominant color description representative of skin tone, (iii) whereinsaid lower spatial region comprises, at least in part, said pair ofregions having a dominant color description representative of stagecolor; (iv) wherein said pair of regions are generally symmetric to eachother with respect to a generally center column of a frame of saidvideo; (v) wherein said pair of regions move toward one another; (vi)where each of said segments includes a plurality of frames of saidvideo; and (b) creating a summarization of said video by including saidplurality of segments, where said summarization includes fewer framesthan said video.
 35. A method of processing a video including sumocomprising: (a) identifying a plurality of segments of said video,wherein said identifying for at least one of said segments includesdetecting the start of said segment based upon processing of a firstsingle frame of said video, where each of said segments includes aplurality of frames of said video; (b) verifying that said first singleframe is an appropriate start of said segment based upon processing ofanother single frame temporally relevant to said first single frame; and(c) creating a summarization of said video by including said pluralityof segments, where said summarization includes fewer frames than saidvideo.
 36. A method of processing a video including sumo comprising: (a)identifying a plurality of segments of said sumo video, wherein saididentifying for the end of at least one of said segments is based upondetecting a scene change, where each of said segments includes aplurality of frames of said sumo video; and (b) creating a summarizationof said sumo video by including said plurality of segments, where saidsummarization includes fewer frames than said sumo video.
 37. The methodof claim 36 wherein said scene change is based upon a threshold betweenat least two frames.
 38. The method of claim 36 wherein said scenechange is based upon a gradual transition below a threshold level.
 39. Amethod of processing a video including sumo comprising: (a) identifyinga plurality of segments of said video, where each of said segmentsincludes a plurality of frames of said video; (b) identifying aplurality of segments that are temporally separated by a sufficientlyshort duration; (c) based upon said identifying as a result of (b)connecting said identified plurality of segments; and (d) creating asummarization of said video by including said plurality of segments,where said summarization includes fewer frames than said video.
 40. Themethod of claim 39 wherein said connecting includes discarding theframes of said video between said identified plurality of segments. 41.The method of claim 39 wherein said connecting results in a singlesegment that includes said identified plurality of segments togetherwith the frames of said video between said identified plurality ofsegments.
 42. A method of processing a video including sumo comprising:(a) identifying a plurality of segments of said video, where each ofsaid segments includes a plurality of frames of said video; (b)identifying at least one of said segments that has a temporallysufficiently short duration; (c) based upon said identifying as a resultof (b) removing said identified segment from said summarization; and (d)creating a summarization of said video by including said plurality ofsegments, where said summarization includes fewer frames than saidvideo.
 43. The method of claim 42 wherein said connecting includesdiscarding the frames of said video between said identified plurality ofsegments.
 44. The method of claim 42 wherein said connecting results ina single segment that includes said identified plurality of segmentstogether with the frames of said video between said identified pluralityof segments.
 45. A method of processing a video including sumocomprising: (a) identifying a plurality of segments of said videowherein each of said segments includes a play of sumo, wherein saidsegments include full-speed plays and slow motion plays of saidfull-speed plays; and (b) creating a summarization of said video byincluding said plurality of segments, where said summarization includesfewer frames than said video, where a user may select from: (i) saidsummarization including only full-speed plays; (ii) said summarizationincluding only slow motion plays; (iii) said summarization includingboth full-speed plays and slow motion plays.
 46. A method of processinga video including sumo comprising: (a) identifying a plurality ofsegments of said video wherein each of said segments includes a play ofsumo; (b) creating a summarization of said video by including saidplurality of segments, where said summarization includes fewer framesthan said video; and (c) removing at least one of said segments fromsaid summary based, at least in part, upon audio information related tosaid at least one of said segments.
 47. The method of claim 46 whereinsaid audio information is obtained exclusively from a temporal analysis.48. A method of processing a video including sumo comprising: (a)identifying a plurality of segments of said video wherein each of saidsegments includes a play of sumo; (b) creating a summarization of saidvideo by including said plurality of segments, where said summarizationincludes fewer frames than said video; and (c) modifying the duration ofat least one of said segments from said summary based, at least in part,upon audio information related to said at least one of said segments.49. The method of claim 48 wherein said audio information is obtainedexclusively from a temporal analysis. 50-60. (canceled)
 61. A method ofprocessing a video including sumo comprising: (a) identifying aplurality of segments of said video, wherein the detection of graphicaltext segments is identified based upon: (i) a pair of substantiallywhite regions generally symmetric with respect to the center of theimage, (ii) said image free from other significant substantially whiteareas; (iii) said white regions persist for a plurality of seconds; (iv)said white regions preceding the start of a play; (v) where each of saidsegments includes a plurality of frames of said video; and (b) creatinga summarization of said video by including said plurality of segments,where said summarization includes fewer frames than said video. 62-64.(canceled)